Introduction to Machine Learning via Tidymodels

Surprise

You have already done machine learning in the class

\(~\)

Linear regression from the lake ice model

\(~\)

fit <- lm(doy ~ year, data = sunapee)

Prediction vs. understanding

  • Modeling for understanding
    • Parameters and model matter
    • e.g., is the parameter that relates temperature to growth different from zero?
  • Modeling for prediction
    • Quality of prediction matters
    • Machine learning!

Machine learning

Machine learning is a field of inquiry devoted to understanding and building methods that “learn” – that is, methods that leverage data to improve performance on some set of tasks. - Wikipedia

Broad classes of ML

Supervised

  • Labels are available for each set of predictors (class or continuous numeric)

  • Goal is to predict the labels

Unsupervised

  • Labels are NOT available for each set of predictors.

  • Goal is to use predictors to create groups

Broad classes of ML

Supervised

  • Examples:
    • Images are labeled by a human (criminal vs. non-criminal)
    • Values are provided for a continuous variable (i.e., stream nitrate)

Unsupervised

  • Examples:
    • Recommendation systems - if you like a movie, ML can find movies like it to recommend
    • What are the characteristics of different groups that buy a product?

ML Flow chart

Big picture steps in ML

  1. Define question (what are you predicting…is it ethical?)
  2. Obtain data
  3. Define the type of ML (supervised vs. unsupervised, regression vs. classification).
  4. Identify method that will be used (this influences how data will be pre-processed)
  5. Pre-process data (also called feature engineering)
  6. Define specific approach for applying the model (i.e., which R package)

Big picture steps in ML

  1. Define data splits (training vs. testing)
  2. Fit (train) model to training data
  3. Evaluate (validate) the model with testing data (need to evaluate with testing data because ML learns the training data)

Big picture steps in ML

  1. Deploy model (predict new data).

Lake Ice module as ML

  1. Define question: what is the lake ice in 2030?
  2. Obtain data: read_excel
  3. Define the type of ML: supervised regression
  4. Identify method that will be used: linear regression
  5. Pre-process data: I already converted the date to DOY
  6. Define specific approach for applying the model: lm

Lake Ice module as ML

  1. Define data splits: we didn’t do this
  2. Fit (train) model to training data: lm(doy ~ year, data = sunapee)
  3. Evaluate (validate) the model with testing data: we didn’t do this

Lake Ice module as ML

  1. Deploy model (predict new data): predicted for year 2030

Tidyverse take on ML

Tidymodels (meta-package)

Tidymodels

The tidymodels framework is a collection of packages for modeling and machine learning using tidyverse principles.

\(~\)

Whether you are just starting out today or have years of experience with modeling, tidymodels offers a consistent, flexible framework for your work.

Tidymodels

Overview of module

Datasets

  • Forest carbon from NEON (from prior module)
  • Lake Ice (from prior module)

Plan

  • Instruction in tidymodels applied to predicting carbon
  • Assignment part 1: apply to predicting lake ice
  • Instruction in tuning ML models
  • Assignment part 2: apply predicting to forest carbon

Focal dataset

Predicting the mean vegetation carbon stocks for each plot in the NEON data

Columns

 [1] "plotID"     "daylength"  "precip"     "range"      "tavg"      
 [6] "solar"      "tmax"       "tmin"       "vpd"        "elevation" 
[11] "nlcdClass"  "lat"        "long"       "plotType"   "ndvi"      
[16] "siteID"     "plot_kgCm2"

Number of rows

[1] 942

Challenge

Predict out-of-sample data

  • You have the “predictors” for 48 additional sites
  • I have the carbon stocks
  • You will submit your predictions of these 48 via Canvas
  • I will summarize and compare the results.